A Brief Introduction to Clustering and Community Detection Methods Using NetworkX

Author: Erika Fille Legara


The MIT License (MIT)

Copyright (c) 2016 Erika Fille Legara

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Description

Real-world networks reveal clustering behaviour, which is exhibited in the formation of communities/clusters/partitions (terms are used interchangeably in this notebook) in the graph structure. When studying networks, their structure and function, it is crucial to identify these bunchings. In this recipe, we explore these different structures (including cliques and connected components) with a strong focus on clustering and community detection. In particular, we explore two general methods, and three submethods:

  1. Divisive Method
    • Girvan-Newman algorithm using edge betweenness (Section 3)
  2. Agglomerative Method
    • Ravasz (hierarchical clustering) algorithm using distance matrix (Section 4)
    • Louvain method using modularity (Section 5)

Note: This notebook is a supplementary material to Michael Lees and Debraj Roy's A Short Practical Introduction to NetworkX for the 2016 NTU Winter School on Complex Systems and also to the presentation on "Community Structures" by the author.

Let's Dig-in

Import all necessary packages for this notebook to run smoothly. If you are encountering errors in importing some of the packages, please check the 0. Introduction and Setup notebook by Michael and Debraj.


In [1]:
try:
    ## For Network Analysis and Visualization
    import networkx as nx
    import numpy as np
    import matplotlib.pyplot as plt
    %matplotlib inline
    from collections import defaultdict
    import operator

    ## For Hierarchical Clustering
    from scipy.cluster import hierarchy
    from scipy.spatial import distance
    
    ## For Community Detection (Louvain Method)
    import community
    
except:
    import traceback
    traceback.print_exc()
    raise ImportError('Something failed, see above.')

1. Load the Network

Here, we use the Zachary's karate club network $K$ to illustrate some concepts on network structures. This network was first presented in [1].

[1] W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of Anthropological Research 33, 452-473 (1977).


In [3]:
K = nx.karate_club_graph()

2. Network Structure and Properties

  1. Clique
  2. Connected Components
  3. Edge Betweenness

In [4]:
print "Number of nodes: ", K.size()
print "Number of edges: ", K.order()


Number of nodes:  78
Number of edges:  34

In [5]:
pos = nx.fruchterman_reingold_layout(K);

In [6]:
plt.figure(figsize=(8,8));
plt.axis("off");
nx.draw_networkx_nodes(K, pos, node_size=300, node_color="black");
nx.draw_networkx_edges(K, pos, alpha=0.500);
nx.draw_networkx_labels(K, pos, font_color="white");
plt.show();


Real-world partition


In [7]:
mrhi = [0,1,2,3,4,5,6,7,8,10,11,12,13,16,17,19,21]
johna = [9,14,15,18,20,22,23,24,25,26,27,28,29,30,31,32,33]

for mem in K.nodes():
    if mem in mrhi:
        K.node[mem]["group"] = "mrhi"
    else:
        K.node[mem]["group"] = "johna"

In [8]:
plt.figure(figsize=(8,8));
plt.axis("off");
nx.draw_networkx_nodes(K,pos,
                       nodelist=mrhi,
                       node_color='r',
                       node_size=300,
                   alpha=0.8);
nx.draw_networkx_nodes(K,pos,
                       nodelist=johna,
                       node_color='b',
                       node_size=300,
                   alpha=0.8);
nx.draw_networkx_edges(K, pos, alpha=0.5);
nx.draw_networkx_labels(K, pos, font_color="white");


Cliques

Cliques are complete subgraphs; that is, subgraphs where the nodes are all directly connected to each other. For example, using the Karate Club network $K$, the cliques are listed below. Note that any two connected nodes in an undirected network are considered to form a clique.


In [9]:
print list(nx.find_cliques(K))


[[0, 1, 17], [0, 1, 2, 3, 13], [0, 1, 2, 3, 7], [0, 1, 19], [0, 1, 21], [0, 4, 10], [0, 4, 6], [0, 5, 10], [0, 5, 6], [0, 8, 2], [0, 11], [0, 12, 3], [0, 31], [1, 30], [2, 32, 8], [2, 9], [2, 27], [2, 28], [5, 16, 6], [24, 25, 31], [24, 27], [25, 23], [33, 32, 8, 30], [33, 32, 14], [33, 32, 15], [33, 32, 18], [33, 32, 20], [33, 32, 22], [33, 32, 23, 29], [33, 32, 31], [33, 9], [33, 13], [33, 19], [33, 26, 29], [33, 27, 23], [33, 28, 31]]

3. Divisive Method: The Girvan-Newman Algorithm

Edge Betweenness

The betweenness of an edge $e$ is the fraction of shortest-paths that course through it.

$$c_B(e)=\sum _{i,j \in V}\frac{\sigma (i,j)|e}{\sigma (i,j)},$$

where $\sigma(i,j)$ is the total number of shortest paths and $\sigma (i,j)|e$ is the number of shortest paths that pass through edge $e$. In networkx, the function is edge_betweenness_centrality(G). Below, we illustrate the function using Zachary's karate club $K$.


In [10]:
#order the dictionary by value, which is the edge betweeness of two nodes
ebet = nx.edge_betweenness_centrality(K)
sorted_ebet = sorted(ebet.items(), key=operator.itemgetter(1), reverse=True)
sorted_ebet[0:5]


Out[10]:
[((0, 31), 0.12725999490705373),
 ((0, 6), 0.07813428401663694),
 ((0, 5), 0.07813428401663694),
 ((0, 2), 0.0777876807288572),
 ((0, 8), 0.07423959482783014)]

Iteratively remove edges with the highest edge betweenness. For purpose of illustration, we first make a copy of the karate club network $K$. We then perform $i=10$ iterations of edge removal.


In [11]:
K2 = K.copy()

def remove_top_ebet(K2):
    ebet = nx.edge_betweenness_centrality(K2)
    sorted_ebet = sorted(ebet.items(), key=operator.itemgetter(1), reverse=True)
    edge_to_remove = sorted_ebet[0]
    K2.remove_edge(*edge_to_remove[0])

    return K2

for i in range(10):
    K2=remove_top_ebet(K2)

In [12]:
nx.draw(K2, pos)
nx.draw_networkx_labels(K2, pos)
plt.show()



In [13]:
for i in range(5):
    K2=remove_top_ebet(K2)
    
nx.draw(K2, pos)
nx.draw_networkx_labels(K2, pos)
plt.show()


The algorithm for finding "communities" based on edge betweenness removal was presented by Girvan-Newmann in their 2002 PNAS paper. Based on the method implemented, the algorithm is classified as a dividive procedure (as opposed to agglomerative, which is discussed in the next section. A code snippet taken here is shown below.

Connected Components

Connected components are subgraphs where all the nodes are connected to each via a path. Unlike cliques, they don't have to be directly connected. Moreover, the Giant Component refers to the network's largest component.


In [14]:
print "Number of components: ", len(list(nx.connected_components(K2))),"\n"
print list(nx.connected_components(K2))


Number of components:  3 

[set([0, 1, 3, 4, 5, 6, 7, 10, 11, 12, 13, 16, 17, 19, 21]), set([32, 33, 2, 8, 14, 15, 18, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]), set([9])]

4. Agglomerative Method: Hierarchical Clustering

In this method, we start with a distance matrix $D$, which is a matrix that provides the distance between two nodes $i$ and $j$ in network $G$. Using $D$, we then perform hierarchical clustering in a recursive manner, group similar nodes at each step. There are various ways to perform the similarity grouping using agglomerative procedures such as single, complete, and average, to name a few. In the example below, we implement the average/UPGMA linkage, which is the one implemented in the Ravasz Algorithm published in Science in 2002.

In the following cells, we perform hierarchical clustering and visualize the order in which the nodes are grouped together in communities.


In [15]:
__author__ = """\n""".join(['Erika Fille Legara <legareft@ihpc.a-star.edu.sg>',
                            'Maksim Tsvetovat <maksim@tsvetovat.org',
                            'Drew Conway <drew.conway@nyu.edu>',
                            'Aric Hagberg <hagberg@lanl.gov>'])
    
'''
The original code was written by Drew Conway and Aric Hagberg and was later 
modified by Maksim Tsvetovat for his book titled “Social Network Analysis for Startups”.
EF Legara slightly modified Tsvetovat's version to address some issues 
on the ordering of keys within the path_length dictionary. Moreover,
in this code, a different agglomerative procedure is implemented.
'''

def create_hc(G, t):
    ## Set-up the distance matrix D
    labels=G.nodes()    # keep node labels
    path_length=nx.all_pairs_shortest_path_length(G)
    distances=np.zeros((len(G),len(G)))   
    for u,p in path_length.items():
        for v,d in p.items():
            distances[G.nodes().index(u)][G.nodes().index(v)] = d
            distances[G.nodes().index(v)][G.nodes().index(u)] = d
            if u==v: distances[G.nodes().index(u)][G.nodes().index(u)]=0

    # Create hierarchical cluster (HC)
    # There are various other routines for agglomerative clustering, 
    # but here we create the HCs using the complete/max/farthest point linkage 
    Y = distance.squareform(distances) ## the upper triangular of the distance matrix 
    Z = hierarchy.average(Y)
    
    # This partition selection (t) is arbitrary, for illustrive purposes
    membership=list(hierarchy.fcluster(Z,t=t))
    
    # Create collection of lists for blockmodel
    partition = defaultdict(list)
    for n,p in zip(list(range(len(G))),membership):
        partition[p].append(labels[n])
        
    return Z, membership, partition

In [16]:
Z, membership, partition = create_hc(K, t=1.15) 
partition.items()


Out[16]:
[(1, [4, 6]),
 (2, [5, 10]),
 (3, [16]),
 (4, [0, 1, 2, 3, 7, 11, 12, 13, 17, 19, 21]),
 (5, [28, 31]),
 (6, [23, 25]),
 (7, [24, 27]),
 (8, [8, 9, 14, 15, 18, 20, 22, 26, 29, 30, 32, 33])]

In [17]:
partition = {}
i = 0
for i in range(len(membership)):
    partition[i]=membership[i]

In [18]:
plt.figure(figsize=(10,10))
plt.axis('off')

nx.draw_networkx_nodes(K, pos, cmap=plt.cm.RdYlBu, node_color=partition.values())
nx.draw_networkx_edges(K, pos, alpha=0.5)
nx.draw_networkx_labels(K, pos)
plt.show()


Dendrogram

The dendrogram allows us to visualize the order in which nodes are grouped together in the clustering process.


In [19]:
plt.figure(figsize=(10,8))
hierarchy.dendrogram(Z)
plt.show()


5. Modularity-Based Community Detection: The Louvain Method

The community detection algorithm that we implement in this notebook is a modularity-based algorithm. The modularity $M_c$ quantifies how good a "community" or partition is, and is given by

\begin{equation} M_c = \sum _{c=1} ^{n_c} \left[ \frac{L_c}{L} - \left (\frac{k_c}{2L} \right)^2\right] \end{equation}

where $n_c$ is the number of communiies, $L_c$ is the number of links within the community, $k_c$ is the total degree of nodes belonging to the community, and $L$ is the total number of links in the network. The illustration below that is taken from the slide deck shows how modularity is calculated.

The higher the network modularity is, the more optimal the partitionioning. And, if you only have a single community, $M=0$.

In this recipe, we implement a modularity-maximization algorithm called the Louvain algorithm. We go back to Zachary's karate club and find the best partitions.


In [21]:
partition = community.best_partition(K)

Now, let's redraw the network; this time, highlighting the different communities by node color.


In [22]:
plt.figure(figsize=(10,10))
plt.axis('off')

nx.draw_networkx_nodes(K, pos, cmap=plt.cm.RdYlBu, node_color=partition.values())
nx.draw_networkx_edges(K, pos, alpha=0.5)
nx.draw_networkx_labels(K, pos)
plt.show()


The algorithm produces four (4) partitions. Let's compare this partition to the actual.


In [23]:
plt.figure(figsize=(8,8));
plt.axis("off");
nx.draw_networkx_nodes(K,pos, nodelist=mrhi, node_color='r',
                       node_size=300, alpha=0.8);
nx.draw_networkx_nodes(K,pos, nodelist=johna, node_color='b', 
                       node_size=300, alpha=0.8);
nx.draw_networkx_edges(K, pos, alpha=0.5);
nx.draw_networkx_labels(K, pos, font_color="white");


6. More Real-World Networks

Sample 1: The Marvel Universe

There are two types of Marvel Universe networks that is available on the web, the Comic Network and the Hero Network. The Comic Network connects the characters to the commic issues they appeared in; on ther hand, the Hero Network represents the connections between characters as they appear in the same comic issue (weighted). In this notebook, we look at the Hero Network.


In [24]:
import unicodecsv as csv

M = nx.Graph(name="Marvel Universe")
reader = csv.reader(open("./Module 6 Datasets/hero-network.csv", 'rU'))
for row in reader:
    M.add_edge(*row)

To export this network for use in Gephi, use

nx.write_gexf(M, "marvel_universe.gexf")

Graph Statistics


In [25]:
M.size() #number of edges


Out[25]:
167219

In [26]:
M.order() #number of nodes


Out[26]:
6426

In [27]:
M_cliques=list(nx.find_cliques(M))

In [28]:
print M_cliques[0:2]


[[u'CRUSADER II/ARTHUR B', u'HELA [ASGARDIAN]', u'HOGUN [ASGARDIAN]', u'ODIN [ASGARDIAN]', u'HEIMDALL [ASGARDIAN]', u'THOR/DR. DONALD BLAK', u'SIF', u'FANDRAL [ASGARDIAN]'], [u'CRUSADER II/ARTHUR B', u'BLACK KNIGHT V/DANE ', u'THOR/DR. DONALD BLAK', u'X-MAN/NATHAN GREY ']]

Finding Communities


In [29]:
partition = community.best_partition(M)

In [30]:
len(set(partition.values()))


Out[30]:
25

The code below may take a while to run since we are looking at a relatively large network, so we're commenting it out.'

plt.figure(figsize=(10,10))
plt.axis('off')

pos = nx.spring_layout(M)
nx.draw_networkx_nodes(M, pos,cmap=plt.cm.RdYlBu, node_color=partition.values())
nx.draw_networkx_edges(M, pos, alpha=0.4)
plt.savefig("Marvel Universe.png")

But this is how it looks using Gephi.

What are these communities? Which community is the largest? Sneak a look into the communities.


In [31]:
comm = defaultdict(list)
for k, v in partition.items():
    comm[v].append(k)

In [32]:
print comm.items()[11]


(11, [u'RYKER, MAJOR SIMON', u'NOX', u'BLACK, CYRUS', u'BURR, CLAYTON', u'MANDARIN, JAMES', u'STERN', u'LITTLETREE, LINDA', u'PENNINGTON, AUDRA', u'HASTINGS, PROF. LOUI', u'STERNBERG, DR. BERTH', u'DWORMAN, JIM', u'NANABOZHO', u'HAMIR', u'DELGADO, LUZ', u'CLOUD', u'WONG', u'SLAVINSKI, SALLY', u'RIOT/HEIDI SLATKIN', u'ASMODEUS', u'CYTTORAK', u'CAPTAIN UNIVERSE/STE', u'TROLL/', u'DRAKE, FRANKLIN', u'OVERRIDER/', u'KREA, BRUNO', u'HANSEN, BILLY', u'ROMMEL, EDWIN ROMULU', u'BLAZE, ROXANNE SIMPS', u'WEISENTHAL, DR. JONA', u'LOHAN, PETER', u'CELESTE', u'GRIFFIN', u'LANGFORD, DR. DAVID', u'FEAR EATER/KKALLAKKU', u'WOLF', u'DOC', u'INTERLOPER', u'HU, DR.', u'DEATHWATCH/STEPHAN L', u'JACOBS, BEN', u'DR. STRANGER YET', u'DEIMOS', u'BRUKNER, DR. FRANK', u'NIGHTMARE/EDVARD HAB', u'BLAZE, CRAIG', u'UMAR', u'NAVARRO', u'SIFFROR', u'SALTZ/', u'YGMLOSVORRAICH', u'FLORESCU, ANTON', u'MAINFRAME/IAN WAJLER', u'HELLSTROM, VICTORIA', u'SEWARD, ANNE', u'CHALMERS, MR.', u'KETCH, BARBARA', u'NICODEMUS', u'RYKER, HARLAN', u'NEVILLE, KATE', u'BENTLEY, VICTORIA', u'AGGAMON', u'URTHONA', u'HAG/', u'LEVIATHAN IV', u'STRANGE', u'DEATHLOK/COLONEL LUT', u'WOODS, MIRANDA', u'WOLFE, SARA', u'SPIKE', u'GREEN, ARNIE', u'COLLINS, TRACY', u'SISTER NIL', u'NAKOTA', u'HOTAMITANIO', u'STYGE, REV.', u'RIVERA, MS.', u'PSI-BORG/', u'RAHL', u'MEPHISTA', u'MALICE II', u"D'KAY", u'SOULFIRE/JASON MALDO', u'SILVER DAGGER/ISAIAH', u'BADALAMENTE, JESUS', u'CALUMET', u'BLESSING, MORGANA', u'ANDROMEDA/ANDROMEDA ', u'HEDRON', u'ZOM', u'MORDO, ASTRID', u'BELIAL', u'LURKING UNKNOWN', u'BLACKOUT II/', u'PARADOX', u'MOTHER-OF-PAIN', u'ZOMBIE/SIMON GARTH', u'YAYAUHCO', u'SHUMA-GORATH', u'TRAVERS, MIKE', u'PIXIE II/', u'STRONTIUM-90/', u'AGAMOTTO', u'MR. FEAR/ZOLTAN DRAG', u'LIN, MEI', u'MACKENZIE, AL', u'DELAZNY, CHARLES L.', u'DR. STRANGE/STEPHEN ', u'COLLINS, NICK', u'ASSAULT/', u'MORRISON, JESSICA', u'SATHAN', u'SIMPSON, CRASH', u'CREED', u'ANGSTROM, EARL', u'FURMAN, RICHARD', u'SATANNISH', u'SCARECROW II', u'RAZORBLADE II/', u"D'SPRYTE", u'LEGION', u'IVORY/', u'NAMELESS ONE II', u'VIOLENCE/', u'GODWULF', u'SILVER FOX/BAYAN', u'WENDIGO IV/MICHAEL F', u'SATANA', u'KALUU', u'DICE/ROB FORTUNE', u'MOONSTOMP/', u'PHEMOUS', u'BARTON, SEN.', u'NUGENT', u'BUCHANAN, SAM', u'GUILOTINE', u'RAKESH', u'ARMAND', u'KHAT', u'GHOST RIDER II/JOHNN', u'GREEN, LILA', u'WILDPRIDE/KYLLIAN', u'AGED GENGHIS', u'STEVENS, DR. VINCENT', u'IRONWOOD, CISSY', u'VESPER', u'FLEET, JONAS', u'HARKER, QUINCY', u'STAR THIEF II', u'WEI, LINDA', u'DWELLER IN DARKNESS', u'WOODS, TIMOTHY', u'ROULETTE/DUDE JENSEN', u'KLEB, ROSA', u'MAUVIUS', u'DEATHLOK II/COL. JOH', u'SEWARD, DR. CHARLES', u'REYNOLDS, KATHERINE', u'CARETAKER', u'RILEY, MARIEL', u'IKONN', u'BERTRAM', u'FANG III', u'KELLY, KAREN', u'MEAGHER, TYLER', u'PILGRIM', u'SIEGEL, DR. AVRAM', u'PAPA NEBO', u'PYRE', u'RIESER, ROB', u'RAGGADORR', u"D'AURIA, JACK", u'DENAK', u'ORINI', u'MORRISON, AMANDA PAY', u'APPALLA', u'POWDERKEG/FRANK SKOR', u"DEATH'S HEAD", u'CROSS, STANLEY', u'SNOWBLIND/', u'MITCHELL, ROBERT', u'TOFIELD, FRANCESCA', u'BARON BLOOD II/VICTO', u'SUICIDE/CHRIS DANIEL', u'MONTESI, VICTORIA', u'PIERCE, ALEXANDER GO', u'PAIN', u'KLEIN, SUMMER', u'KALE, NAOMI', u'QUENTIN, RALPH', u'LORD LEADER', u"D'SPAYRE", u'RYDER', u'BADALINO, LT. MICHAE', u'MEATMARKET', u'CLEA', u'LEVIATHAN V', u'SEWARD, JANE', u'SCHEELE, STEPHANIE', u'KELLY, SARA', u'DAKINI', u'WHITE, JONATHAN', u'KETCH, DORIS AUDRA', u'DORMAMMU', u'KIMBLE, DR.', u"LILITH/ANGEL O'HARA/", u'TAKIMOTO, RON', u'COFFIN, RAY', u'SEWARD, ELISABETH', u'GARTH, DONNA', u'BATTERY/', u'TERREL, DEBORAH', u'BELLE', u'FUANARO, RENEE', u'ELECTRA', u'LUDI', u'BARON STRUCKER/WOLFG', u'KING, HANNIBAL/HENRY', u'EXECUTIONER V', u'THAMES, SERIPHA', u'HOBBS, DET.', u'MIDWIFE', u'MCKENNA-DRAKE, MARLE', u'DOLAN, STACY', u'MOLGOTHA', u'DRAGONUS', u'SHAW, OBADIAH', u'THOG', u'LOBO II', u'ZODIAK/NORMAN HARRIS', u'MUNIPOOR', u'NETWORK NINA', u'BLADE', u'ADRIA', u'INNARDS', u'AZRAEL', u'MALEVOLENCE', u'WILMOTT, DR. KEITH', u'GREASE', u'NOBLE', u'RAMA KALIPH', u'KNOCKABOUT/', u'VALTORR', u'GALTUS', u'MANNING, JANICE', u"MIND'S EYE/MEACHAM", u'SLICK', u'GORN', u'BARON MORDO/KARL MOR', u'MALTZ', u'LILITH II', u'NINGAL', u'OWAYODATA', u"N'GAMI", u'FISHMAN, DR. JONATHA', u'GLEEG', u'CHANG, IMEI', u'BACILLUS', u'DRAGON OF THE MOON', u'MENNINGER, CLARA', u'HELLSTORM DOPPELGANG', u'BLAZE, EMMA', u'INFERNO', u'GHOST RIDER III/DAN ', u'STRANG, COL./LT. GAR', u'SHAITAN, AL', u'STILES, TATJANA', u'SKINNER', u'HOGGOTH', u'MALPRACTICE', u'MCCORMICK, FIONA', u'VITCHEN', u'ANCIENT ONE', u'BIOHAZARD', u'SPETSBURO, GEN. YURI', u'BRASS/SEAN WATANABE', u'AFTERLIFE', u'SHADOW QUEEN/SHIALMA', u'HARKNESS, ABIGAIL', u'DOGGEREL', u'GRIEFSAVOR', u'VAN HELSING, RACHEL', u'SKARABROUS', u'JASMINE', u'DRUMM, COLLETTE', u'MANSLAUGHTER', u'WATOOMB', u'FLEET, TAMARA', u'WATANABE, EMMALINE B', u'ROTWRAP', u'WATANABE, YUJI', u'FOWLER, RED', u'OSHTUR', u'KUROV, WASSILY', u'TREVANI, DANIKA', u'AZOPARDI', u'PHOBOS', u'CAULDER, SAFRON', u'RILEY, JACK', u'TOPAZ', u'RINTRAH', u'LUMP', u'SHAW, HIRAM', u'SHAW, SARAH', u'LUPE', u'FROST, DEACON', u'SILVERCLOUD, SAM', u'MORAN, KEVIN', u'SMOKESCREAM/', u'SALOME', u'SHIVER MAN', u'CATWALK/KIMBERLY DEE', u'SPITE/', u'VILIVEN', u'DOLAN, CAPT. THOMAS', u'RILEY, KIM', u'BRUTUS KLOR', u'LAVEAU, MARIE', u'SATAN/MARDU KURIOS', u'SCUZZ'])

Sample 2: Word Co-occurrence Network

In this example, we look at the word co-occurrence dataset used in the paper titled "News Framing of Population and Family Planning Issues via Syntactic Network Analysis" by E.F. Legara et al. Using a community detection algorithm, we looked for word patterns (how the words were 'arranged' or connected in the sentences of each news article) to evaluate how the media framed the population issue in the Philippines through the use of the different labels to refer to it in public discourse.


In [33]:
F = nx.read_gml("./Module 6 Datasets/word-net.gml", label="label")

In [34]:
# nx.write_gexf(F, "framing.gexf")

In [35]:
partition = community.best_partition(F, weight="weight")

Commenting this out because the network is so dense, it doesn't say much when plotted.

plt.figure(figsize=(10,10))
plt.axis('off')

pos = nx.spring_layout(F, iterations=150)
nx.draw_networkx_nodes(F, pos,cmap=plt.cm.RdYlBu, node_color=partition.values(), node_size=10, label=True)
nx.draw_networkx_edges(F, pos, alpha=0.4)
plt.savefig("Frames.png")

Instead, let's view the communities by printing out the words.


In [36]:
coms = defaultdict(list)
for k in partition.keys():
    coms[partition[k]].append(k)

In [37]:
print coms[0], "\n", coms[1], "\n", coms[2], "\n", coms[3], "\n", coms[4]


[u'catholic_church', u'code', u'protest', u'population_management', u'violate', u'passage', u'biazon', u'aniceto', u'committee', u'culture', u'passe', u'constitution', u'state', u'international', u'approach', u'parent', u'opposition', u'group', u'strong', u'arroyo', u'roman', u'ippf', u'respect', u'remulla', u'conference', u'penalty', u'raise', u'atienza', u'manongdo', u'life', u'warn', u'challenge', u'bill', u'abortion', u'advocacy', u'fvr', u'worry', u'federation', u'prolife', u'administration', u'cardinal', u'manila', u'sin', u'wrong', u'value', u'president', u'law', u'reodica', u'population_control', u'position', u'hierarchy', u'clinton', u'jose', u'police', u'family', u'pro-abortion', u'department_of_energy', u'responsible_parenthood', u'human_rights', u'propose', u'bacani', u'provision', u'dignity', u'success', u'continue', u'conflict', u'moral', u'right', u'senate', u'house', u'society', u'lead', u'religion', u'pampanga', u'leader', u'belief', u'effect', u'pressure', u'district', u'freedom', u'beijing', u'macapagal', u'act', u'block', u'tatad', u'little', u'commission', u'divorce', u'support', u'legislation', u'fight', u'priority', u'human', u'regard', u'threaten', u'oppose', u'declare', u'election', u'influence', u'politics', u'lie', u'responsible', u'plcpd', u'urge', u'macapagal-arroyo', u'francisco', u'congress', u'pgma', u'acosta', u'wife', u'debate', u'representative'] 
[u'consider', u'fuente', u'focus', u'controversial', u'impact', u'explain', u'sex_education', u'charge', u'woman', u'risk', u'department_of_education', u'induce', u'school', u'attitude', u'common', u'activity', u'fertility', u'husband', u'prevent', u'force', u'sex', u'college', u'professional', u'&#1089;', u'expert', u'review', u'safe_sex', u'cordillera', u'public', u'promiscuity', u'active', u'protect', u'engage', u'clarify', u'institute', u'experience', u'love', u'danger', u'pregnancy', u'union', u'homosexual', u'unsafe', u'expose', u'adult', u'demand', u'misconception', u'science', u'behavior', u'aid', u'male', u'guide', u'claim', u'hiv_aids', u'partner', u'media', u'necessary', u'status', u'student', u'older', u'uteru', u'industry', u'responsibility', u'talk', u'lapu', u'raymundo', u'course', u'public_school', u'majority', u'labor', u'practice', u'earn', u'early', u'test', u'know', u'desire', u'difficult', u'sexually', u'home', u'girl', u'special', u'unwanted_pregnancies', u'world_health_organization', u'research', u'formula', u'high_school', u'hinder', u'marriage', u'complication', u'unwanted', u'ask', u'filipino', u'abstinence', u'consequence', u'relate', u'sexually_transmitted_disease', u'female', u'hidalgo', u'sex_worker', u'percentage', u'relationship', u'premarital_sex', u'youth', u'limit', u'few', u'education', u'employment', u'department', u'concern', u'transmit', u'leave', u'important', u'wive', u'class', u'cause', u'disease', u'face', u'traditional', u'fact'] 
[u'sector', u'global', u'asian', u'follow', u'fertility_rate', u'cebu', u'proportion', u'rice', u'mortality', u'division', u'regional', u'local_govt_units', u'fall', u'affect', u'population_growth', u'level', u'large', u'united_nations', u'release', u'trend', u'result', u'birth_rates', u'project', u'davao', u'future', u'capital', u'muslim', u'southeast', u'slow', u'china', u'address', u'blame', u'visaya', u'nscb', u'poverty', u'samar', u'projection', u'win', u'manage', u'period', u'decrease', u'debt', u'mark', u'change', u'achieve', u'figure', u'attribute', u'household', u'high', u'erap', u'drive', u'provincial', u'goal', u'afp', u'economy', u'rise', u'maternal', u'different', u'produce', u'data', u'grow', u'expand', u'statistical', u'death', u'national_statistics_office', u'demographic', u'policy', u'nation', u'birth', u'censu', u'die', u'development', u'determine', u'people', u'reduce', u'elderly', u'economic', u'expect', u'osia', u'fusilero', u'officer', u'decline', u'base', u'thailand', u'post', u'quezon', u'actual', u'sustain', u'toma', u'major', u'economic_growth', u'rank', u'low', u'estimate', u'gain', u'population', u'develop', u'official', u'up', u'problem', u'united_nations_population_fund', u'migration', u'contribute', u'alarm', u'curb', u'authority', u'mindanao', u'duque', u'income', u'poor', u'thi', u'incidence', u'reach', u'significant', u'commission_on_population', u'average', u'statistic', u'pangasinan'] 
[u'livelihood', u'relation', u'welfare', u'children', u'biliran', u'environment', u'health', u'local', u'condition', u'gender', u'cost', u'health_professional', u'educate', u'capacity', u'access', u'available', u'importance', u'empower', u'deliver', u'water', u'improve', u'social', u'community', u'public_official', u'village', u'illegal', u'hospital', u'management', u'basic', u'citizen', u'aware', u'medical', u'process', u'attain', u'technology', u'needs', u'grant', u'provide', u'sustainable', u'maternal_health', u'date', u'stress', u'center', u'inform', u'mother', u'help', u'clinic', u'suffer', u'personal', u'better', u'food', u'resource', u'reproductive_health', u'benefit', u'security', u'comprehensive', u'patient', u'care', u'basis', u'lacson', u'facility', u'appropriate', u'quality', u'service', u'system', u'nutrition', u'serve', u'counsel', u'reproductive', u'urban', u'recognize', u'midwive', u'makati', u'rural'] 
[u'fear', u'calendar_method', u'vasectomy', u'hope', u'program', u'condom', u'choice', u'communication', u'modern', u'remove', u'method', u'radio', u'enforce', u'favor', u'donate', u'movement', u'family_planning', u'foreign', u'approve', u'campaign', u'trust', u'market', u'use', u'prove', u'birth_control', u'firm', u'dkt', u'fund', u'budget', u'department_of_health', u'suppose', u'control', u'prevalence', u'accept', u'flavier', u'juan', u'usaid', u'non-government', u'stop', u'natural', u'effective', u'allow', u'fail', u'order', u'breastfeeding', u'money', u'distribute', u'flavor', u'supply', u'pill', u'government', u'couple', u'intrauterine_device', u'crisostomo', u'kintanar', u'artificial', u'proper', u'space', u'legal', u'encourage', u'safety', u'dayrit', u'withdrawal', u'free', u'intense', u'perform', u'buy', u'ligation', u'promote', u'actively', u'fertilize', u'contraception', u'ban', u'scientific', u'intend', u'device', u'adopt', u'safe', u'push']

Sample 3: Political Blogs

Here, we use the dataset utilized in the paper titled "The political blogosphere and the 2004 US Election" by L. A. Adamic and N. Glance (2005). It's a directed network of hyperlinks between weblogs on US politics.


In [38]:
# P = nx.read_gml("./Module 6 Datasets/polblogs.gml")
P = nx.read_gml("./Module 6 Datasets/polblogs copy.gml")

Some Network Properties


In [39]:
P.nodes(data=True)[0:2]


Out[39]:
[(u'tsrightdominion.blogspot.com', {u'source': u'Blogarama', u'value': 1}),
 (u'rightrainbow.com', {u'source': u'Blogarama', u'value': 1})]

In [40]:
P.order(), P.size()


Out[40]:
(1490, 19090)

In [41]:
pos = nx.spring_layout(P)
nx.draw(P, pos, node_size=30)
plt.show()


Finding Communities

To perform the community detection algorithm, the directed graph needs to be made into an undirected one.


In [42]:
PR = P.to_undirected()
PR = nx.Graph(PR)

In [43]:
mypalette = ["blue","red","green", "yellow", "orange", "violet", "grey", "grey","grey"]

In [44]:
pos = nx.spring_layout(PR)
#colors = [mypalette[PR.node[i]['value']] for i in range(1,len(PR.nodes()))]
colors = [mypalette[PR.node[i]['value']] for i in PR.nodes()]
nx.draw(PR, pos, node_color=colors, node_size=10)
plt.show()


Extract Giant Component

In [45]:
Gcc=sorted(nx.connected_component_subgraphs(PR), key = len, reverse=True)
GC=Gcc[0]

In [46]:
partition = community.best_partition(GC)

In [47]:
print "Number of Communities: ", len(set(partition.values()))


Number of Communities:  9

In [48]:
for k, v in partition.items():
    GC.node[k]["louvain-val"] = v
colors = [mypalette[GC.node[node]["louvain-val"]] for node in GC.nodes()]

In [49]:
plt.figure(figsize=(10,10))
plt.axis('off')
pos = nx.spring_layout(GC, scale=3)
nx.draw_networkx_nodes(GC, pos, node_color=colors, node_size=20, label=True)
nx.draw_networkx_edges(GC, pos, alpha=0.4)
plt.savefig("polblogs.png")


Sample 4: Your Own Network Data

Load the Network


In [ ]:

Network Properties and Structures


In [ ]:

Relevant References

  • Network Science by Albert-László Barabási (2015)
  • Social Network Analysis for Startups: Finding connections on the social web by Maksim Tsvetovat and Alexander Kouznetsov (2011)
  • Practical Data Science Cookbook by Tony Ojeda et al. (2014)

Some Network Data Repositories